OpenAI released ChatGPT Images 2.0, based on the GPT Image 2 model. The core highlight is enhancing AI's thinking ability, making it more like a logical creator. The new version introduces a reasoning planning feature, which conducts online information retrieval and logical analysis before generating images, changing the previous 'blind box' style image generation mode and improving the ability to handle complex visual tasks.
Google launches Gemini Embedding2, a multimodal embedding model that unifies text, images, videos, audio, and PDFs into a single semantic space, enhancing AI data processing and multimodal retrieval capabilities.....
Google launches Gemini Embedding2, the first multimodal embedding model based on the Gemini architecture, now in preview on Gemini API and Vertex AI. It maps text, images, videos, audio, and documents into a unified embedding space for cross-modal retrieval and classification, supporting over 100 languages.....
ByteDance's Volcano Engine will release a technical upgrade on February 14th, focusing on the launch of the "Doubao" series version 2.0, including the audio and video tool Seedance 2.0 and the image tool Seedream 5.0 Preview. Seedance 2.0 achieves industry-leading interaction and image stability, supports full-modal input, and the output quality meets professional requirements such as film and advertising. Seedream introduces the real-time information retrieval function for the first time, ensuring that the creative content is synchronized with current social events.
Nano Banana 2 integrates Google's 4K AI image generation technology, supporting semantic retrieval and high-resolution output.
Multimodal information retrieval and reranking model, supporting inputs such as text, images, and videos.
A multilingual multimodal embedding model for text and image retrieval.
A multimodal embedding model enabling seamless retrieval of text, images, and screenshots.
Google
$0.49
Input tokens/M
$2.1
Output tokens/M
1k
Context Length
Openai
$2.8
$11.2
Xai
$1.4
$3.5
2k
$0.7
$17.5
Anthropic
$21
$105
200
Alibaba
$2
$20
-
$1
$10
256
$3.9
$15.2
64
Bytedance
$0.8
128
facebook
DINOv3 is a series of general visual foundation models developed by Meta AI. It can outperform specialized advanced models in various visual tasks without fine-tuning. The model adopts the Vision Transformer architecture and is pre-trained on 1.689 billion web images. It can generate high-quality dense features and performs excellently in tasks such as image classification, segmentation, and retrieval.
redlessone
DermLIP is a vision-language model specifically designed for the field of dermatology, trained on the largest dermatology image-text corpus, Derm1M. This model adopts a CLIP-style architecture and can perform various dermatology-related tasks, including zero-shot classification, few-shot learning, cross-modal retrieval, and concept annotation.
NCSOFT
GME-VARCO-VISION-Embedding is a multimodal embedding model that focuses on calculating the semantic similarity between text, images, and videos in a high-dimensional embedding space, and is particularly good at video retrieval tasks.
chs20
FuseLIP is an innovative multi-modal embedding model. It uses an extended vocabulary to process both text and discrete image tokens simultaneously through a single Transformer model, achieving early and in-depth fusion between modalities. This method solves the problem that traditional contrastive language-image pre-trained models cannot natively encode multi-modal inputs, and performs excellently in tasks such as visual question answering and text-guided image retrieval.
kshitij3188
PHOENIX is a domain-adaptive model based on CLIP/ViT, designed to enhance patent image retrieval capabilities, particularly suitable for retrieving semantically or hierarchically related images rather than exact matches.
nomic-ai
ColNomic Embed Multimodal 3B is a 3-billion-parameter multimodal embedding model specifically designed for visual document retrieval tasks, supporting unified encoding of multilingual text and images.
tsystems
A multilingual visual retrieval model based on Qwen2.5-VL-3B-Instruct and ColBERT strategy, supporting dynamic input image resolution and multilingual document retrieval.
A multilingual visual retrieval model based on Qwen2.5-VL-3B-Instruct and ColBERT strategy, supporting dynamic input image resolution and generating ColBERT-style multi-vector text and image representations.
vidore
A visual retrieval model based on Qwen2-VL-2B-Instruct and ColBERT strategy, capable of generating multi-vector representations for text and images
A visual retrieval model based on Qwen2.5-VL-3B-Instruct and ColBERT strategy, capable of generating multi-vector representations for text and images to enable efficient document retrieval.
gersonrpq
This model is trained on paintings from a 2-hour museum tour, aiming to enhance painting feature extraction in image retrieval and zero-shot frame learning.
ModelsLab
This is a vision-language model based on the OpenCLIP framework, trained on the LAION-2B English subset, excelling in zero-shot image classification and cross-modal retrieval tasks.
recallapp
A vision-language model trained on the LAION-2B English dataset based on the OpenCLIP framework, supporting zero-shot image classification and cross-modal retrieval
JianLiao
A vision-language model fine-tuned based on CLIP ViT-L/14, optimized for abstract image-text retrieval tasks
llm-jp
A Japanese CLIP model trained based on the OpenCLIP framework, trained on a dataset of 1.45 billion Japanese image-text pairs, supporting zero-shot image classification and image-text retrieval tasks
A visual retrieval model based on Qwen2-VL-2B-Instruct and ColBERT strategy, capable of generating multi-vector text and image representations
yydxlv
A visual retrieval model based on Qwen2-VL-7B-Instruct and ColBERT strategy, supporting multi-vector text and image representation
uta-smile
InstructCIR is an instruction-aware contrastive learning-based compositional image retrieval model, utilizing ViT-L-224 and Phi-3.5-Mini architectures, focusing on image-text-to-text generation tasks.
google
A multimodal model based on the SoViT backbone network, improved with the Sigmoid loss function, supporting zero-shot image classification and image-text retrieval
openbmb
VisRAG is a retrieval-augmented generation (RAG) system based on vision-language models (VLM) that can directly embed documents as images, avoiding information loss caused by traditional text parsing.
The AWS MCP Servers are a set of dedicated servers based on the Model Context Protocol, offering various AWS-related functions, including document retrieval, knowledge base query, CDK best practices, cost analysis, image generation, etc., aiming to enhance the integration of AI applications with AWS services through a standardized protocol.
An MCP server based on TypeScript that provides Gyazo image integration services, supporting image search, retrieval, upload, and metadata access functions.
The Groundlight MCP server is a service for creating and managing image detectors. It supports multiple detection modes, including binary classification, multi-class classification, and counting functions, and provides interfaces for image queries and result retrieval.
This is a collection of MCP servers focused on the medical field, covering various medical - related MCP service implementations such as PubMed literature retrieval, access to medical preprints, FHIR data interaction, DICOM medical image processing, protein structure analysis, medical computing tools, and integration of medical education resources.
An MCP server that provides image retrieval and processing functions, supporting loading images from URLs, local paths, and numpy arrays, and returning base64-encoded strings and MIME types.
An MCP service for Unsplash image search and retrieval implemented in Go, providing functions such as keyword search, random image retrieval, and detailed image information query, supporting multiple connection modes and rich filtering options.
A TypeScript - based Gyazo image integration MCP server that provides image search, retrieval, and upload functions, supporting access to image resources and metadata via URI.
This project is a series of MCP servers based on SerpAPI and YouTube, providing AI assistants with various search functions, including Google Search, News, Scholar, Trends, Finance, Maps, Images, as well as YouTube Search and caption retrieval.
A Go - based Unsplash image search MCP service that provides functions for image search, detail retrieval, and random image selection, supporting multiple connection modes and advanced filtering conditions.
A server based on the MCP protocol, providing interaction functions with the Magic: The Gathering Chinese card database, including card query, set retrieval, and creative text image generation and other featured functions.
An MCP server for searching and obtaining images from Wikimedia Commons, providing image search and metadata retrieval functions, especially suitable for scenarios where compliant use of public domain images is required.